Â
freely available online: R for Data Science
read more
special purpose programming language for data science statistical computing
authority says to tell you to not think of R as a programming language!
think of it as a tool optimized for creating scripts to manipulate, plot and analyze data
a lot of innovation and development takes place in packages
go browse some 12,000 packages on CRAN
Â
install packages (only once)
install.packages('tidyverse')load packages (for every session)
library(tidyverse)base R functionality is always available
x = seq(from = 1, to = 10, length.out = 1000)
plot(x,x^2)packages bring extra functions
library(ggplot2)
ggplot2::qplot(x,x^2)Â
integrated development environment for R
for all base R stuff, check the R manual
6 * 7## [1] 42
x = c(1,2,3)
x + 1## [1] 2 3 4
supports object-oriented, procedural & functional styles
convenient interfaces to other languages
assignment in both directions possible
x <- 3
3 -> y
x == y## [1] TRUE
Â
help('qplot')qplot {ggplot2} R Documentation
Quick plot
Description
qplot is a shortcut designed to be familiar if you're used to base plot(). It's a convenient
wrapper for creating a number of different types of plots using a consistent calling scheme.
It's great for allowing you to produce plots quickly, but I highly recommend learning ggplot()
as it makes it easier to create complex graphics.
Usage
qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
main = NULL, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)
typeof(2)## [1] "double"
c()x = c(10,20,30)
x## [1] 10 20 30
c(length(200), length("huhu"))## [1] 1 1
x[2]## [1] 20
m = matrix(c(1,2,3,4,5,6), nrow = 2)
m## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
m[1,]## [1] 1 3 5
m %*% x ## dot product## [,1]
## [1,] 220
## [2,] 280
typeof("huhu")## [1] "character"
chr.vector = c("huhu", "hello", "huhu", "ciao")
chr.vector## [1] "huhu" "hello" "huhu" "ciao"
factor(chr.vector)## [1] huhu hello huhu ciao
## Levels: ciao hello huhu
factor(chr.vector, ordered = T,
levels = c("huhu", "ciao", "hello"))## [1] huhu hello huhu ciao
## Levels: huhu < ciao < hello
my.list = list(dudu = 1,
chacha = c("huhu", "ciao"))exp.data = data.frame(trial = 1:5,
condition = factor(c("C1", "C2", "C1",
"C3", "C2"),
ordered = T),
response = c(121, 133, 119, 102, 156))
exp.data## trial condition response
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
exp.data$condition## [1] C1 C2 C1 C3 C2
## Levels: C1 < C2 < C3
exp.data[3,]## trial condition response
## 3 3 C1 119
as.tibble(exp.data)## # A tibble: 5 x 3
## trial condition response
## <int> <ord> <dbl>
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
exp.data## trial condition response
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
 Â
 Â
my.tibble = tibble(x = 1:10, y = x^2) ## dynamic construction possible
my.dataframe = data.frame(x = 1:10, y = x^2) ## ERROR :/mydist is associated with four functions:
dmydist(x, ...) gives the probability (mass/density) \(f(x)\) for xpmydist(x, ...) gives the cumulative distribution function \(F(x)\) for xqmydist(p, ...) gives the value \(x\) for which p = pmydist(x, ...)rmydist(n, ...) returns n samples from the distributionx = seq(-5, 5, length.out = 1000)
y = dnorm(x, mean = 1, sd = 0.5)
plot(x,y)data = tibble(IQ = c(100,110,120,125),
RT = c(67,58,98,80) )
map_dbl(data, mean)## IQ RT
## 113.75 75.75
tibble(IQ = c(100,110,120,125),
RT = c(67,58,98,80) ) %>%
map_dbl(mean)## IQ RT
## 113.75 75.75
crazy.operation = function(x,y) {
x+y
}
crazy.operation(2,3)## [1] 5
tibble(IQ = c(100,110,120,125),
RT = c(67,58,98,80) ) %>%
map_dbl(function(i) {max(i)-min(i)})## IQ RT
## 25 40
Â
data from experimental (psych) studies is usually rectangular data
Â
examples of (usually) non-rectangular data:
Â
the tidyverse is particularly efficient for dealing with tidy rectangular data
library(nycflights13)
nycflights13::flights## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
study Chapters 5 and 12 from R for Data Science
Â
this is untidy if we want to analyze/plot grade as a function of exam type
grades = tibble(name = c('Michael', 'Noa', 'MadEye'),
midterm = c(3.7, 1.0, 1.3),
final = c(4.0, 1.3, 1.0))
grades## # A tibble: 3 x 3
## name midterm final
## <chr> <dbl> <dbl>
## 1 Michael 3.7 4
## 2 Noa 1 1.3
## 3 MadEye 1.3 1
to tidy up, we need to gather columns which are not separate variables into a new column
grades %>% gather('midterm', 'final',
key = 'exam', value = 'grade')## # A tibble: 6 x 3
## name exam grade
## <chr> <chr> <dbl>
## 1 Michael midterm 3.7
## 2 Noa midterm 1
## 3 MadEye midterm 1.3
## 4 Michael final 4
## 5 Noa final 1.3
## 6 MadEye final 1
this is untidy if we want to analyze grade as a function of participation
results = tibble(name = c('Michael', 'Noa', 'MadEye',
'Michael', 'Noa', 'MadEye'),
what = rep(c('grade', 'participation'),
each = 3),
howmuch = c(3.7, 1.0, 1.0, 55, 100, 100))
results## # A tibble: 6 x 3
## name what howmuch
## <chr> <chr> <dbl>
## 1 Michael grade 3.7
## 2 Noa grade 1
## 3 MadEye grade 1
## 4 Michael participation 55
## 5 Noa participation 100
## 6 MadEye participation 100
to tidy up, we need to spread cells from a row out over several columns
results %>% spread(key = 'what', value = 'howmuch')## # A tibble: 3 x 3
## name grade participation
## <chr> <dbl> <dbl>
## 1 MadEye 1 100
## 2 Michael 3.7 55
## 3 Noa 1 100
for background see Wickham (2010)
fully explicit
ggplot() +
layer(
data = diamonds,
mapping = aes(x = carat, y = price),
geom = "point",
stat = "identity",
position = "identity"
) +
scale_x_continuous() +
scale_y_continuous() +
coord_cartesian()with syntactic sugar and defaults
diamonds %>% ggplot(aes(carat, price)) + geom_point()ggplot callfrom the cheat sheet
Â
prepare, analyze & plot data right inside your document
export to a variety of different formats
Â
Â
headers & sections
# header 1
## header 2
### header 3emphasis, highlighting etc.
*italics* or _italics_
**bold** or __italics__
~~strikeout~~links
[link](https://www.google.com)inline code & code blocks
`function(x) return(x - 1)`extension of markdown to dynamically integrate R output
multiple output formats:
cheat sheet and a quick tour
inline equations with $\theta$
equation blocks with
$$ \begin{align*} E &= mc^2 \\
& = \text{a really smart forumla}
\end{align*} $$
Â
caveat
LaTeX-style formulas will be rendered differently depending on the output method: